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ABSTRACT 

The gap between data production and user ability to access, 
compute and produce meaningful results calls for tools that 
address the challenges associated with big data volume, ve¬ 
locity and variety. One of the key hurdles is the inability 
to methodically remove expected or uninteresting elements 
from large data sets. This difficulty often wastes valuable re¬ 
searcher and computational time by expending resources on 
uninteresting parts of data. Social sensors, or sensors which 
produce data based on human activity, such as Wikipedia, 
Twitter, and Facebook have an underlying structure which 
can be thought of as having a Power Law distribution. Such 
a distribution implies that few nodes generate large amounts 
of data. In this article, we propose a technique to take an 
arbitrary dataset and compute a power law distributed back¬ 
ground model that bases its parameters on observed statistics. 
This model can be used to determine the suitability of using 
a power law or automatically identify high degree nodes for 
filtering and can be scaled to work with big data. 

Index Terms — Big Data, Signal Processing, Power Law 

1. INTRODUCTION 

The 3 V’s of big data; volume, velocity and variety H], pro¬ 
vide a guide to the outstanding challenges associated with 
working with big data systems. Big data volume stresses the 
storage, memory and compute capacity of a computing sys¬ 
tem and requires access to a computing cloud. The velocity 
of big data stresses the rate at which data can be absorbed 
and meaningful answers produced. Big data variety makes 
it difficult to develop algorithms and tools that can address 
the large diversity of input data. One of the key challenges 
is in developing algorithms that can provide analysts with ba¬ 
sic knowledge about their dataset when they have little to no 
knowledge of the data itself 

In 11, we proposed a series of steps that an analyst should 
take in analyzing an unknown dataset including a technique 
similar to spectral analysis - Dimensional Data Analysis 
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(DDA). As a next step in the data analysis pipeline, we pro¬ 
pose determining a suitable background model. Big data sets 
come from a variety of sources such as social media, health 
care records, and deployed sensors, and the background 
model for such datasets are as varied as the data itself. One of 
the popular statistical distributions used to explain such data 
is the power law distribution. There is much related work in 
the area. Studies such as EISISISI have looked at power 
laws as underlying models for human generated big data col¬ 
lected from sources such as social media, network activity 
and census data. However, there has also been some contro¬ 
versy that while data may look like it follows a power law 
it may in fact be better described by other distributions such 
as exponential or log-normal distributions Q 0. While a 
power law distribution may seem a htting background model 
for an observed dataset, large fluctuations in lower degree 
terms (the tail of the distribution) may skew the estimation of 
power law parameters i). Further, the estimation of power 
law exponent can be heavily dependent on decisions such as 
binning Go) which may lead to problems such as estimator 
bias ifm . 

We propose a technique that takes an unknown dataset, es¬ 
timates the parameters and binning of the degree distribution 
of an power law distributed dataset that follows constraints 
enforced by the observed dataset, and aligns the degree dis¬ 
tribution of the observed dataset to the structure of the perfect 
power law distribution in order to provide a clear view into 
the applicability of a power law model. 

The article is structured as follows: Section |2] describes 
the relationship between signal processing and big data; Sec- 
tionj^describes the proposed technique to estimate the power 
law parameters of a dataset; Section describes application 
examples. Finally, we conclude in Section]^ 

2. SIGNAL PROCESSING AND BIG DATA 

Detection theory in signal processing is the ability to dis¬ 
cern between different signals based on the statistical prop¬ 
erties of the signals. In the simplest form, the decision is 
to choose whether a signal is present or absent. In making 
this determination, the observed signal is compared against 
the expected background signal when no signal is present. A 
deviation from this background model indicates the presence 



of a signal. While it is common to represent the background 
model as additive white gaussian noise (AWGN), the model 
may change depending on physical factors such as channel 
parameters or known noise characteristics. 

Big data can be as a considered a high dimensional signal 
that is projected to an n-dimensional space. Big data, sim¬ 
ilar to the 1-, 2- and 3-D signals that signal processing has 
traditionally dealt with, are often noisy, corrupted, or mis¬ 
aligned. The concept of noise in big data can be thought of as 
unwanted information which impairs the detection of activ¬ 
ity or important entities present in the dataset. For example, 
consider a situation in which network logs are collected to 
determine foreign or dangerous connections out of a network. 
Detecting such activity may be difficult due to the presence of 
few vertices (such as connections to www.google.com) with 
a very large number of connections (edges). This information 
can be considered the equivalent of stop words in text analyt¬ 
ics lEiin]. While such data may help form a useful statistic, 
often, these entries impair the ability to find activity of in¬ 
terest that occurs at a lower activity threshold. Often, such 
vertices with a large number of edges (high degree vertices) 
are manually removed based on empirical evidence in order 
to improve the big data signal to noise ratio. Knowledge of 
a suitable big data background model can highlight such ver¬ 
tices and help with the automated removal of components in 
the dataset which are of minimal interest. This concept paral¬ 
lels the concept of filtering in signal processing. In fact, such 
parallels between big data and signal processing kernels are 
numerous. In iflTll . for example, the authors look at the com¬ 
monality between certain signal processing kernels and graph 
operations. As another example, the authors of study 
data collected from social networks and their underlying sta¬ 
tistical distribution. 

A common distribution that is observed in many datasets 
is the power law distribution. A power law distribution is one 
in which a small number of vertices are incident to a large 
number of edges. This principle has also been referred to as 
the Pareto principle, Zepf’s Law, or the 80-20 rule. A power 
law distribution for a random variable x, is defined to be: 

p{x) = cx~°‘ 

where the exponent a represents with the power of the dis¬ 
tribution. An illustrative example of 10,000 randomly gener¬ 
ated points drawn from a power law distribution with expo¬ 
nent a = 1.8 is shown in Figure The middle figure shows 
the histogram of such a random variable. Finally, the right¬ 
most image shows the degree distribution of the signal. Note 
the large number of low degree nodes and small number of 
high degree nodes. While a power law distribution may be 
applicable to a given dataset based on physical characteristics 
or empirical evidence, a more rigorous approach is needed to 
fit observed data to a power law model in order to verify the 
applicability of a power law distribution. 


3. POWER LAW MODELING TECHNIQUE 

This section describes the proposed technique for compar¬ 
ing an observed dataset with an ideal power law distributed 
dataset whose parameters are derived from a statistical analy¬ 
sis of the observed dataset. 

3.1. Definitions 

A large dataset can be represented by a graph through the ad¬ 
jacency matrix representation or incidence matrix representa¬ 
tion M- An adjacency matrix has dimensions n x n, where 
n corresponds to the number of vertices in the graph. A ver¬ 
tex out-degree is a count of the number of edges in a directed 
graph which leave a particular vertex. The vertex in-degree, 
on the contrary, is a count of the number of edges in a directed 
graph which enter a particular vertex. A popular way to rep¬ 
resent the in-degree and out-degree distributions is through 
the degree distribution which is a statistic that computes the 
number of vertices that are of a certain degree. Such a count 
is very relevant to techniques such as graph algorithms and 
social networks. 

The in-degree and out-degree of a graph can be deter¬ 
mined by the following relations: 

dinil') — ^ ^ Eij 
3 

3 

where din (*) dout {i) represent the in and out degree at bin 

i and E represents the graph incidence matrix. The selection 
of the the total number of bins {Nj) is an important factor in 
determining the degree distribution given by dm, dout, and 
i < Nd- An illustrative example is shown in Figure to 
demonstrate the parameters described in this section. 

The degree distribution conveys many important pieces of 
information. For example, it may show that a large number of 
vertices have a degree of 1, which implies that a majority of 
vertices are connected only to one other vertex. Further, it 
may show that there are a small number of vertices with high 
degree (many edges). In a social media dataset, such vertices 
may correspond to popular users or items. 

The maximum degree vertex is said to be = dmax- 
The total number of vertices {N) and edges (M) can be com¬ 
puted as: 

Nd 

N = J2nid,) 

Nd 

M = n{di) * di 







Fig. 1: Computing the histogram and degree distribution of 10,000 data points in which the magnitude was determine by 
drawing from a power law distribution with a = 1.8. 
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Fig. 2; Adjacency matrix for notional power law distributed 
random graph with corresponding in and out degree distribu¬ 
tions. 

where n{di) is defined as number (count) of vertices with de¬ 
gree dj. 

3.2. Power Law Fitting 

Step 1: Find parameters of observed data 

In order to determine the power law parameters for an ar¬ 
bitrary data set, data is first converted to an adjacency ma¬ 
trix representation. In the adjacency matrix, row and columns 
represent vertices with outbound and inbound edges respec¬ 
tively. A non-zero entry at a particular row and column pair 
indicates the existence of an edge between two vertices. Of¬ 
ten, data may be collected and stored in an incidence matrix 
where rows of the matrix represent edges and columns repre¬ 
sent vertices. For data in an intermediate format such as the 


incidence matrix (E), it is possible to convert this representa¬ 
tion to the adjacency matrix (A) using the following relation; 

A=\E <0\^ *\E> 0| 

where |i? < 0| and > 0| represent the incidence matrix 
with outbound and inbound edges only. From the adjacency 
matrix, it is possible to calculate the degree distribution as 
described to extract the parameters a, the vertex with max¬ 
imum degree dmax, the number of vertices with exactly 1 
edge, n{di), and the number of bins N^. There are many pro¬ 
posed methods to calculate the power law exponent iniiioi. 
For the purpose of this study, a simple first order estimate of a 
is sufficient and it should satisfy the intuitive property that the 
count n{di) and degree dmax be included since most natural 
datasets will have at least one vertex with degree = I and a 
vertex with large degree. Furthermore, the exponent should 
take into account. Therefore, we propose the following sim¬ 
ple relationship to calculate a: 

a = log{n{di))/log{dmax) (1) 

Using the power law exponent calculated in Equation [T] al¬ 
lows an initial comparison between the observed dataset and 
a power law distribution. 

Step 2: Calculate “Perfect” Power Law Parameters 

Given the degree distribution of the observed data we can 
compute the parameters: 

and using the relations provided in the definitions sec¬ 

tion (Section HB- In order to see if the observed data fits a 
power law distribution, we need to be able to determine what 
an ideal power law distributed dataset would look like for pa¬ 
rameters similar to those observed. This ideal distribution is 
referred to as a “perfect” power law distribution. 

The “perfect” power law distribution can be determined 
by computing the parameters di, n{di), and Nj_ which closely 
fit the observed data while also maintaining the number of 
















vertices and edges. While theoretically, any dis 
which satisfies the properties a > 0, dmax > 1 and 
can be used to form a power law model, we also desii 
which maintain the total number of vertices and e 
and M. Essentially, given an observed number of 
and edges, compute the quantities Nd, n{di) and t 
i < Nd that form a power law distribution (with c 
also satisfy the property that M « and N A 
These values can be solved by using a combir 
optimization techniques such as exhaustive search, s: 
annealing, or Broyden’s algorithm, to find the value 
n{di) that minimize: 


min/((ii,n(di)) 

d^n 


= n(di)p + [Atofes _ J2nid,)*d, 


where and are the observed number of edges and 
vertices. From the estimate of di and n{di) we can determine 
Nd (given by the number of output di), and dmax (given by 
dN,,)- 

Step 3: Align observed data with background model 

The values of Nd, di, and n{di) from the previous step, 
provides a power law distributed dataset with power a. How¬ 
ever, the degree binning may be different from the observed 
distribution. In order to compare the observed data with the 
background model, it is necessary to rebin the observed data 
such that it aligns with the background model. Using the re¬ 
binned observed data (represented by the parameters 
and it is possible to determine the power law na¬ 

ture of the observed dataset. Both datasets use the same de¬ 
gree binning using algorithm 1. 

Data: di, df^ 

Result: 

for i=l:Nd do 

^reb^n ^ 

nf^^rebm) _ of Vertices binned to i 

end 

Algorithm 1: Algorithm to rebin observed data into fitted 
data bins 


4. APPLICATION EXAMPLE 

To demonstrate the application of steps provided in Section 


pus of news articles provided by Reuters. We use the open 
source Dynamic Distributed Dimensional Data Model (D4M, 
d4m.mit.edu) to store and access the required data. As a first 


3.2 we describe two examples - a Twitter dataset and a cor¬ 


User tntity Degree Distribution 



Eig. 3: Fitting a Power Law distribution to Twitter user data. 


step, data is converted to the D4M schema na, which or¬ 
ganizes data into an associative array, representing data as an 
incidence matrix. The Twitter dataset contains all the meta¬ 
data associated with approximately 2 million tweets. For the 
purpose of this example, we have considered only a subset of 
the data which corresponds to Twitter usernames. 

To begin, we determine the adjacency matrix of the asso¬ 
ciative array data using the relation outlined in the previous 
section. With the adjacency matrix, we can determine the de¬ 
gree distribution of the observed dataset as demonstrated by 
the blue circles in Figure]^ Using the values of N and M from 
this distribution, we can find the values of di and n{di) that 
fit the N and M values of the observed dataset. The obtained 
values are plotted in Figurej^as the black triangles. As a final 
step, we rebin the original degree distribution to align with 
the bins of the ideal distribution. The results of rebinning the 
observed distribution are shown in Figurej^as red plus signs. 

For the Twitter user data of Figure]^ we see that a power 
law distribution provides a good representation of the data. 
The second dataset, a corpus of news articles from Reuters, 
seems to follow a power law distribution. However, once we 
fit the perfect power law distribution and rebin the original 
data, we see that the dataset does not follow a power law dis¬ 
tributions evidenced by the large bulge in Figure 

5. CONCLUSIONS AND FUTURE WORK 

In this article, we presented a technique to uncover the under¬ 
lying distribution of a big dataset. One of the most common 
statistical distributions attributed to a variety of human gener¬ 
ated big data sources, such as social media, is the power law 
distribution. Often, however, data that seems to adhere to a 
power law distribution may not be well described by such a 
distribution. In such situations, it is important to be aware of 













pus. 

the underlying background model of the dataset before fur¬ 
ther processing. Our future work includes investigating the 
big data equivalents of sampling and big data filtering. 
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